Diagnostic accuracy of a machine learning-based radiomics approach of MR in predicting IDH mutations in glioma patients: a systematic review and meta-analysis

Objectives To assess the diagnostic accuracy of machine learning (ML)-based radiomics for predicting isocitrate dehydrogenase (IDH) mutations in patients with glioma. Methods A systematic search of PubMed, Web of Science, Embase, and the Cochrane Library from inception to 1 September 2023, was conducted to collect all articles investigating the diagnostic performance of ML for the prediction of IDH mutations in gliomas. Two reviewers independently screened all papers for eligibility. Methodological quality and risk of bias were assessed using the METhodological RadiomICs Score and Quality Assessment of Diagnostic Accuracy Studies-2, respectively. The pooled sensitivity, specificity, and 95% confidence intervals were calculated, and the area under the receiver operating characteristic curve (AUC) was obtained. Results In total, 14 original articles assessing 1740 patients with gliomas were included. The AUC of ML for predicting IDH mutation was 0.90 (0.87–0.92). The pooled sensitivity, specificity, and diagnostic odds ratio were 0.83 (0.71–0.90), 0.84 (0.74–0.90), and 25 (12,50) respectively. In subgroup analyses, modeling methods, glioma grade, and the combination of magnetic resonance imaging and clinical features affected the diagnostic performance in predicting IDH mutations in gliomas. Conclusion ML-based radiomics demonstrated excellent diagnostic performance in predicting IDH mutations in gliomas. Factors influencing the diagnosis included the modeling methods employed, glioma grade, and whether the model incorporated clinical features. Systematic review registration https://www.crd.york.ac.uk/PROSPERO/#myprospero, PROSPERO registry (CRD 42023395444).


Introduction
The 2016 World Health Organization (WHO) classification of central nervous system tumors incorporated molecular markers (1).The 2021 WHO classification emphasizes the role of molecular markers in both the classification and grading of gliomas (2).The primary markers for gliomas include isocitrate dehydrogenase (IDH), classified as IDH-mutant, 1p/19q-non-codeleted (IDHmut-Noncodel), and IDH wild-type (IDHwt).Patient outcomes and therapeutic options in glioma vary across subtypes (3,4).Patients with an IDH-mutated glioma have a better prognosis than those with an IDH wild-type tumor.Recent studies have demonstrated that IDH may be a potential therapeutic target for IDH-mutant gliomas (5).Therefore, preoperative prediction of IDH mutation status is important for prognosis and therapeutic decision-making.Although histopathology is the current diagnostic gold standard, it has some limitations such as sampling errors, complications, and invasiveness.Thus, noninvasive assessment of IDH mutation status is an urgent requirement.
Radiomics can transform images into mineable data for quantitative analysis through high-throughput extraction and analysis, providing support for decision-making (6).Machine learning and deep learning combined with radiomics have excellent potential for preoperative diagnosis, staging, and therapeutic effect evaluation of gliomas (7,8), as well as for predicting IDH mutation status.A previous systematic review (9) dealing with this subject was published, but it was not quantitative enough to evaluate the predictive performance.In addition, because radiomics research is a complicated process that includes multiple stages, it is critical to evaluate the quality of the method to ensure the reliability and reproducibility of the model before use in clinical work.
The aim of this systematic review and meta-analysis was to evaluate the accuracy of radiomics models in predicting the IDH status of gliomas and to evaluate the methodological quality and risk of bias in radiomics workflows.

Materials and methods
This meta-analysis was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (10) guidelines and registered to the PROSPERO registry (registration number, CRD 42023395444).

Literature search and study selection
The PubMed, EMBASE, Cochrane Library, and Web of Science databases were searched up to 1 September 2023 by two reviewers, C.X.L and Z.J.To identify the relevant articles, only English articles were considered.The following keywords were used to identify relevant studies: ("Glioma" OR "Gliomas") AND ("Isocitrate Dehydrogenase" OR " IDH") AND ("MRI" OR "magnetic resonance imaging") AND ("machine learning" OR "radiomics" OR "deep learning" OR "Artificial Intelligence") The details of search strategies are provided in the Supplementary Materials.
The included articles fulfilled all the following criteria: 1) p a t i e n t s w i t h p a t h o l o g i c a l l y c o n fi r m e d g l i o m a ; 2 ) histopathological examination with the IDH mutation as a reference standard; 3) sufficient data for the reconstruction of 2×2 tables in terms of the diagnostic performance of MR-based radiomics in predicting the IDH of glioma; and 4) original research articles.The exclusion criteria were as follows: 1) each study had at least 10 patients; 2) reviews, case reports, letters, and editorials; 3) studies not focusing on the diagnostic performance of MR-based radiomics in predicting IDH mutations; and 4) insufficient data for the reconstruction of 2×2 table studies with overlapping cohorts.Two authors, C. X.L and Z.J, independently evaluated the eligibility of the included articles, and any disagreements were resolved via discussion with a third author (W.S.W, with 10 years of experience in neuroimaging).

Quality assessment and data extraction
The included articles' methodological quality and the risk of bias at the study level were assessed using the Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS)-2 (11) and METhodological RadiomICs Score (12), respectively.The QUADAS-2 tool included four parts: (a) patient selection, (b) index test, (c) reference standard, and (d) flow and timing.The METhodological RadiomICs Score (METRICS tool included 30 items within 9 categories for evaluating the quality of the radiomics workflow.Two reviewers, C.X.L and G.L.B, assessed the quality of the articles separately and resolved any disagreements through discussion with a third author (W.S.W).
The following data were extracted from the included articles: 1) study characteristics (authors, year of publication, country of origin, study design (prospective vs. retrospective)); 2) patient and clinical characteristics (number of patients, age, WHO grade, reference standard); 3) technical characteristics of magnetic resonance imaging (MRI) (magnetic field strength (T), scanner, scan sequence) and machine learning details (classifier, method of segmentation, VOI or ROI, and external or internal validation).

Statistical analysis
This meta-analysis was performed using Stata 16 Review Manager 5.3 software and Meta-disc 1.4.Pooled sensitivity, specificity, diagnostic odds ratio (DOR), positive likelihood ratio (PLR), and negative likelihood ratio (NLR) with 95% confidence intervals (CIs) were calculated using bivariate random effects, and a summary receiver operating characteristic (SROC) curve and area under the curve (AUC) were generated to illustrate the diagnostic performance.
The heterogeneity of the included studies was calculated using the Q-test (p value ≤ 0.05) and I 2 statistic (>50%) (13).A Spearman coefficient >0.6 indicated the threshold effect (14).Subgroup analysis was performed to further investigate the potential cause of heterogeneity, and the following four covariates were included: 1) machine learning (ML) vs. deep learning (DL), 2) only radiomics vs. combination of radiomics and clinical information, 3) low-grade glioma (LGG) vs. high-grade glioma (HGG), and 4) external validation vs. internal validation.

Quality assessment
The risk of bias and applicability assessment of the included studies, performed using the QUADAS-2 tool, are shown in Figure 2. In terms of the patient selection, two (17,28) studies were deemed to have a low risk of bias, six (15,18,23,(25)(26)(27) exhibited a high risk of bias owing to unclear information regarding the time range and consecution of patients, and six (16,(19)(20)(21)(22)24) were considered to have an unclear risk because of uncertainties in the consecution of patients.Regarding the index test, 13 studies had an unclear risk of bias owing to ambiguity regarding the use of a threshold.All the studies indicated a low risk of bias in the reference standards.Regarding flow and timing, 13 studies had an unclear risk of bias because there was no mention of the time interval between imaging and molecular testing.
The mean METRICS score of the included studies was 60.3% (range, 50%-75%), the quality of six (15-17, 22, 25, 28) studies was moderate (40≤score<60%),and eight studies (17-21, 23, 24, 26) were good (60≤score<80%).The highest score of 75% was obtained Flow chart of study selection.Summary of the risk of bias and applicability assessments: the authors' judgements for each domain of each included study were reviewed.The proportion of included studies that indicated low, unclear, or high risk and applicability concerns are shown in green, yellow, and red, respectively.
in one study (26) and the lowest score of 50% was observed in two studies (22,25), primarily attributed to a lack of a validation cohort.The item of "Model availability" was assigned zero points as none of the included studies addressed it.Only one study (24) publicly shared the code.A detailed description of the METRICS scores is provided in Table 3.
Cochran's Q test showed significant heterogeneity (Q=25.320,p=0.00) across the studies, with a Higgins's I 2 statistic of 79% for sensitivity and 74.1% for specificity.The Spearman correlation coefficient between the sensitivity and false-positive rate was 0.525 (p=0.054), which indicated no threshold effect among the included studies.

Discussion
This systematic review and meta-analysis evaluated the diagnostic performance of radiomics in predicting IDH mutations.The pooled sensitivity, specificity, and AUC were 83% (95% CI, 0.71-0.90),84% (95% CI, 0.74-0.90),and 0.90 (95% CI, 0.87-0.92),respectively.This indicates that radiomics combined with ML and DL could be an effective and accurate diagnostic tool for predicting IDH mutations in gliomas.Obviously, heterogeneity was noted in the specificity (I²=79.6%)and sensitivity (I²=74.1%),Thus, we performed subgroup analysis to explore the source of the heterogeneity which included the modeling methods (ML vs. DL), glioma grade, whether the model incorporated clinical features, and validation methods (external and internal validation).The results of the present meta-analysis showed that studies using ML had a better diagnostic performance than those using DL.This could be attributed to the small sample sizes of the included studies.DL is capable of training multi-layer deep neural networks, which show significant potential in handling very large datasets with thousands or even millions of instances; however, in scenarios where the size of the dataset is small, DL tends to exhibit lower performance compared to ML.Similar findings have been previously reported for ML in other studies (29,30).However, only two studies included in our study used DL; thus, future work should incorporate a greater number of studies with sufficient datasets to explore its true diagnostic capabilities.A previous study (31) demonstrated that the combined model of magnetic resonance (MR) and clinical features with ML exhibits better diagnostic performance than that using only MR features.Clinical features such as age, sex, and exposure to ionizing radiation were closely related to the pathological process of glioma (32,33).For example, age is a risk factor for the development of high-grade glioma; young patients are more likely to suffer from IDH1-mutant glioma, and their postoperative survival and clinical prognosis may  be more optimistic (20).Our findings are consistent with the previous study; therefore, we recommend the combined use of MR and clinical features with ML in future radiomics studies to verify their true diagnostic capabilities in predicting IDH mutation status in gliomas.The diagnostic performance in predicting the IDH mutation of HGG was better than that of LGG in the present study, which is consistent with that of a previous meta-analysis (31); however, it is essential to note that more studies are required to validate this conclusion, given the limited number of included studies.Additionally, we found that studies using external verification models had a diagnostic performance similar to that of studies using internal verification models, demonstrating the stability of the model.Internal validation tends to overestimate the diagnostic value owing to the model's lack of generalizability (34); thus, external validation prediction models are required to reliably estimate the diagnostic capabilities of other datasets.
METRICS is a new quality assessment tool which includes 30 items within 9 categories to evaluate the key steps in the radiomics research workflow.It was developed by a large group of international experts in the field recently and is easy to use, specifically aimed at improving the methodological quality of radiomics research.The METRICS score of the included studies ranged from 50% to 75% and the mean score was 60.3%.The quality of 6 studies was moderate (40≤score<60%) and 8 studies were good (60≤score<80%).For the items with the highest weights, such as high-quality reference standards with a clear definition and eligibility criteria that describe a representative study population, all the included studies received a full score.Only one study (24) publicly shared the code and two studies (25,26) publicly shared the data, however, these two items which belong to the "open science" category had the lowest weight.Although METRICS is a valuable tool for evaluating the quality of radiomics research, it is not without limitations.Further revision of METRICS may enhance its comprehensiveness in assessing the quality of radiomics studies.
QUADAS-2 quality assessment revealed other issues in the included studies that can be avoided in future investigations.For example, the majority of the studies did not mention the consecution of patients and the time interval between imaging and molecular testing, which led to a high or unclear bias risk.In 13 studies, it remained unclear whether thresholds were pre-specified or not, potentially resulting in an overestimation of the diagnostic performance of the models.
This study had several limitations.First, most of the included studies had a retrospective design, and only one had a prospective design; thus, selection bias was inevitable.Therefore, prospective multicenter studies with larger scales are required to validate our findings.Second, the sample size of the included studies was not large enough for training and validation, which limited the statistical power of the study and may affect the generalizability of the results.Third, significant heterogeneity was observed, which is observed in other meta-analyses of diagnostic accuracy using ML based on radiomics.Finally, the mean METRICS score of the included studies was 60.3%, indicating moderate overall quality.Therefore, further high-quality radiomics studies are required to verify our results.Despite these limitations, our review provided new insights into the accuracy of ML-based radiomics models for predicting IDH mutations in gliomas.
In conclusion, ML-based radiomics demonstrated excellent diagnostic performance for predicting IDH mutations in gliomas.Nevertheless, owing to the limitations in the quality and quantity of the included studies, caution should be exercised when applying the results, and more standardized and prospective studies are warranted to improve the application and reliability of radiomics.

FIGURE 3
FIGURE 3Coupled forest plots of the pooled sensitivity and specificity for the diagnostic performance of machine learning-based radiomics for the prediction of IDH mutation glioma.

FIGURE 4
FIGURE 4Hierarchical summary receiver operating characteristic (SROC) curve of the diagnostic performance of machine learning-based radiomics for the prediction of IDH mutation glioma.

TABLE 1
Basic characteristics and details of the 14 included studies (1).
NA, not available.

TABLE 2
Basic characteristics and details of the 14 included studies (2).

TABLE 2 Continued
NA, not available.FIGURE 2

TABLE 3
METRICS of the included studies.
Adherence to radiomics and/or machine learning-specific checklists or guidelines; #2 Eligibility criteria that describe a representative study population ; #3 High-quality reference standard with a clear definition ; #4 Multi-center ; #5 Clinical translatability of the imaging data source for radiomics analysis ; #6 Imaging protocol with acquisition parameters ; #7 The interval between imaging used and reference standard ; #8 Transparent description of segmentation methodology ; #9 Formal evaluation of fully automated segmentation 2 ; #10 Test set segmentation masks produced by a single reader or automated tool ; #11 Appropriate use of image preprocessing techniques with transparent description ; #12 Use of standardized feature extraction software 3 ; #13 Transparent reporting of feature extraction parameters, otherwise providing a default configuration statement ; #14 Removal of non-robust features 4; #15 Removal of redundant features 4 ; #16 Appropriateness of dimensionality compared to data size 4; #17 Robustness assessment of end-to-end deep learning pipelines 5; #18 Proper data partitioning process ; #19 Handling of confounding factors ; #20 Use of appropriate performance evaluation metrics for task ; #21 Consideration of uncertainty ; #22 Calibration assessment ; #23 Use of uni-parametric imaging or proof of its inferiority ; #24 Comparison with a non-radiomic approach or proof of added clinical value ; #25 Comparison with simple or classical statistical models ; #26 Internal testing ; #27 External testing ; #28 Data availability ; #29 Code availability ; #30 Model availability.